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A pipelined processor uses pipe control and register translation to maxanize instnjction flow through, 
in an exemplary embodknent. the execution pipelines of a superscalar, superpipelined microprocessor 
cofnpaUWo wliti the x86 Instruction set architecture. The pipe contiol logic simultaneously issues 
instructions into X and Y pipelines without regard to data dependencies between those inslnjctions 
issued instructions, and in particular without regard to read-after^rite dependencies. Pipe switching 
allows mstrucUons to be Issued into either execution pipeline so as to minimize stalls due to data 
dependenaes between inslmctions. Instniction flow is controfled independently for each pipeline with 
data dependencies between instnjctions being monitored at each stage of a pipeline. A register 
translation unrt implements register renaming to eliminate write-after-read and write^ner-read depen- 
dencies, and data forwarding (both operand ft«warding and result forwarding) to eliminate read- 
afler-wnte dependencies through the pipelines is controlled independently. The register translation unit 
controls the mapping of the eight x86 logical registers into 32 physical registers, including maintaining 
for each physical register status as to availabflity, logical register aflocatbn. and write pending - 
allocattons of ptiysical registers are based on operand size. InstiucUons are flowed to proceed to 
execuUon once it has been determined that the instruction can no longer cause an exception 
permfttjng out-of-order completion. Checkpoint and Exception registers in the register translation unit 
support specidative execution, pennitting pipeline recovery and repair after an instmction exception, 
branch misprediction . or floating point error. A microcontrol unit provides independent microinstruc- 
tion flows for both pipelines, enabling, for selected instruction : (a) both execution stages to be used to 
execute the same instniction. and (b) control of the register translation unit 
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Cross Reference to Related Applications 

This application is related to U.S. Ser. No. 08/138,654. {Ally Deckel No. CX00186), entitled 'Control of 
Data for Speculative Execution and Exception handling in a Microprocessor with Write Buffer* by Garibay et 
5 al, filed concurrently herewith. U.S. Ser. No. 08/138.783, (Atty Docket No, CX00180), entitied "Branch Proc- 
essing Unit" to McMahon, filed concurrently herewith, U.S. Ser. No. 08/138,781. (Alty Docket No. CX00181). 
entitled "Speculative Execution in a Pipelined Processor" to Bluhm, filed concun-enlly herewith, and U.S. Ser. 
No. 08/1 38,855, (Atty Docket No. CX00167) to Hervin el al. entitled "Microprocessor Having Single Qock In- 
struction Decode Architecture", filed concurrently herewith, all of which are incorporated t>y reference herein. 

Background of the invention 

1. Technical Field. 

15 This invention relates in general to microprocessors and more particularly to a pipelined, superscalar mi- 

croprocessor architectxjre. 

2. Related Art 

20 In thedesign of a microprocessor, instructbn throughput / a. the number of instructions executed per sec- 

ond, is of primary importance. The number of inslructlon.<i executed per second may be increased by various 
means. The most straightforward technique for increasing instruction throughput is by Increasing frequency 
at which the microprocessor operates. Increased operating frequency, however, is limited by fabricatton tech- 
niques and also results in the generation of excess heat 

?5 Thus, modern day microprocessor designs are focusing on increasing the instruction throughput by using 
design techniques which increase the average number of Instructions executed per clock cyde period. One 
such technique for increasing instruction throughput is "pipelinbig." Pipelining techniques segment each in- 
struction flowing through the microprocessor into several portions, each of which can be handled by a separate 
stage in the pipeline. Pipelining increases the speed of a microprocessor by overlapping multiple instructions 

30 in executfon. For exair^le. If each Instruction could be executed in six stages, and each stage required one 
clock cyde to perform its function, six separate Instructbns could be simultaneously executed (each executing 
in a separate stage of the pipeline) such that one instruction was completed on each dock cyde. In this ideal 
scenario, the pipelined architecture would have an instruction throughput which was six times greater than 
the non-pipelined architecture, which could complete one instruction every six dock cydes. 

35 A second technique for increasing the speed of a microprocessor is by desfgning it to be a "superscalar." 
In a superscalar architecture, more than one instruction is issued per dock cyde. If no instructions were de- 
pendent upon other instructtons in the flow, the increase in Instruction throughput would be proportional to 
the degree of scaleabllity. Thus, if an architecture were supersc^arto degree 2 (meaning that two instructions 
issued upon each dock cyde). then the instruction throughput in the machine vwxjid double. 

40 ^ A microprocessor may be both 8uperp9>elined {an instruction pipeline with many stages is referred to as 
-superpipelined*) and superscalar to achieve a high instrucUon throughput. However, the operation of such a 
system in practice is far from the ideal sHuation where each instruction can be neatly executed In a given num- 
ber of pipe stages and where the execution of instructions is not interdependent In actual operation, instruc- 
tions have varying resource requ^ements. thus creating intenruptions in the flow of Instructions through the 

45 pipeline. Further, the instructions typically have interdependencies; for example, an instruction which reads 
the value of a register is dependent on a previous instruction which writes the value to that same register - the 
second instruction cannot execute until the first instructk»n has completed Its write to the register. 

Consequently, while supeipipelining and superscalar techniques can increase the throughput of a micm- 
processor, the instruction throughput is highly dependent upon the Implementation of the superpipelined, su- 

50 perscalar architecture. One particular problem Is controlling the flow of instructions in the pipeline Increase 
the instruction throughput without increasing the frequency of the microprocessor. The efficiency of a super- 
pipelined, superscalar machine is diminished as dependencies, or other factors, cause various stages to be 
inactive during operatksn of the micruprocBssor. 

Therefore, a need has arisen for a microprocessor architecture with ef f ident control of the flow of instruc- 

55 tions therein. 
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Summary of tho invention 

The invention involves a superscalar, pipelined processor including a plurality of instruction pipelines, each 
having a plurality of stages in which instructions issued into a pipeline are processed. 
5 In one aspect of the invention, the processor simultaneously issues instructions into multiple pipelines with- 

out regard to data dependencies between such issued instructions. Pipe control means detects dependencies 
between instructions in the pipellres. and controls the flow of instructions through the stages of the pipelines 
such that a first instruction in a current siege of one pipeline is not delayed due to a data dependency on a 
second instruction in another pipeline unless the data dependency must be resoh/ed for prc^r processing of 
10 the first instruction in such current stage. 

In another aspect of the invention, pipe control means controls instruction flow In the pipelines such that, 
for a given stage, a junior instruction cannot nwdify the processor stale before a senior instruction until after 
the senior Instruction can no longer cause an exception. 

In another aspect of the invention, pipe switching means allows selective switching of instructions t>etween 
15 the pipelines^ ordering instructions in the pipelines in a sequence to teduce dependencies between instruc- 
tions, 

in another aspect of the invention, at least two of the pipelines include an execution stage, and a micro- 
controSer means provides independent microinstruction flows to each execution stage, selectively controlling 
each microinstruction flow such that, for selective instructions, the execution stages are independently con- 
20 trolled to process a single instruction. 

In another aspect of the invention, the pipe control means monitors state infcMmation for each stage and 
controls the flow of instructions between stages responsive to the state information, such that an instruction 
in a pipeline can advance from one stage to another independent of the instruction flow in other pipelines. 
In another aspect of the invention, the pipe control means eliminates a dependency between first and sec- 
25 ond instructions by altering the operand source for one of the instructions. 

In another aspect of the invention, the register translation nf>eans maintains processor state Information 
defining a set of physical registers that have been most recently allocated to each of the logical registers in 
response to writes to those logical registers. When Instructions are issued into a pipeline prior to determining 
whether such Instructions will cause an exception, at each stage of the pipeline, the state information for the 
30 instruction at that stage is checkpointed such that, if an instruction causes an execption, the corresponding 
checkpolnted state information is ret neved to restore the processor state to the point of issuing that instruction. 
In addition, if a branch or floating point instruction is issued into a pipeline, and subsequent instructions are 
the speculatively issued behind such branch cm- floating point instruction, the state information for such branch 
or floating point instruction is checkpointed such that. If the branch is mispredicted or the floating point instruc- 
35 tion faults, the corresponding checkpointed state information is retrieved to restore the processor state to the 
point of issuing the branch or floating point instruction. 

In another aspect of the invention, the defined set of logical registers have miitiple addressable sizes as 
sources and destinations of operands for Instructions. The register translation means allocates physical reg- 
isters to one of said defined set of logical registers responsive to an instruction for >A^ting to said one of said 
<o logical registers and the size associated wfth the logical register. 

In another aspect of the invention, the register translation means stores, for each physical register, a cur- 
rent indication and logical ID code indicating whether the physical register contains the cun'ent value for the 
logk:al register identified by such logical ID code. For each access to a logical register, the corresponding log- 
ical ID code is compared with each logical ID code stored vrtth a physical register, and a corresponding physical 
45 ID code is output for the physical register containing the current value of the associated logical register. In ad- 
dition, the register translation unit is capable of storing, for each physical register, status infornnation indicating 
whether a data dependency exists for an associated logical register. 

In another aspect of the invention, ai least one of the execution pipelines includes an execution unit that 
Is controlled by a microcontroller means. For selected instructions, the register translation means is also con- 
so trolled by the mIcrocontrcMler means. 

Embodiments of the present Invention may be Implemented to realize one or more of the following lech- 
nlcafadvantages. Instructions are Issued without regard to dependencies between themso that bubbles are 
not introduced into the pipeline unnecessarily, since dependencies may resolve themseh^es without stalling, 
either through normal instruction flow, or through mechanisms in the pipeline to resolve dependencies. Instruc- 
55 tions may complete out of order after they can no longer fault, which is particularly advantageous with multi- 
box instructions which may take many clock cycles to complete and would otherwise significantly stall the flow 
of instructions. Dependencies may be reduced by switching instructions between the pipelines. Two execution 
units in separate pipelines can be separately controlled to process a single instruction, using two separate 
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mlcroinslruclion flows, without significantly increasing complexity of the execution stages or the microse- 
quencer. Instructions can proceed independently through respective execution pipelines, allowing bHjbbies 
caused by dependencies to t>e removed during the processing of Instructions ~ in, particular, read-af ter-write 
dependencies may be eliminated without recompiling the source code to alter the Older of instructions. All of 
these features and advantages serve to maximize pipeline execution perfonnance. 

Other areas where emtXKlinrwnts of the present invention may be implemented to provide technical ad- 
vantages are as follows. By maintaining the status of pending writes to each of the physical registers, the control 
for allocating registers and providing status information is also simplified. Checkpointing state information re- 
lating to the physical registers allovra the micro processsor to restore the processor state after instructions that 
cause exceptions, mispredicted branches, floating point enrors. or other instruction gttots, within a single dock 
cycle, thereby greatly reducing the penalty on recovery from such an en-or^ Register renaming is supported 
for multiple-sized logical registers to provide elimination of certain data dependencies while maintaining com- 
patibility with existing instruction sets using multiple-sized registers. A physical register can l>e quickly iden- 
tified responsive to a request for a logical register, using a minimum of hardware. Maintaining status infonnatk)n 
associated with the physical registers allows data dependencies to be readily detected. The register translation 
unit used in implementing these features and advantages can be, for selected Instructions, controlled directly 
by a microcontroller controlling the execution unit(s) that process such Instructions (rather than by the normal 
method of using hardware control signals). 


20 Brief Description of the Drawings 

For a more complete understanding of the present invention, and the advantages thereof, reference is now 
made to the following descriptions taken In conjunction with the accompanying drawings, in which: 
Figure la illustrates a block diagram of a superscalar, superpipelined microprocessor; 
25 Figure lb illustrates the seven pipeline stages of the microprocessor, including X and Y execution pipes: 

Figure 2 is a block diagram of an exemplary computer system; 

Figure 3 illustrates a timing diagram showing the flow of instructions through the pipeline unit; 

Figure 4 illustrates a block diagram of the contr<ri mechanism for controlJing the flow of instructtons through 

the pipeline unit; 

.30 Figure 5 illustrates a flow diagram illustrating out-of-order completion of Instructions; 

Figures 6a-b and 7 IHustrate a flow of instruction through the pipeline using pipe switching; 

Figure 8 illustrates a flow diagram describing the method of pipe switching; 

Figure 9 illustrates a functional block diagram of the register translation unit; 

Figure 10 illustrates the control registers used in the register translation unit; 
05 Figure 11 illustrates circuitry for generating bits for the Register Busy register; 

Figure 12a illustrates a representation of a variable size extended register under theX86 architecture; 

Figure 12b illustrates a flow chart for allocating logics registers with variable sizes; 

Figure 13 illustrates circuitry for selectable control of the register translation unit; 

Figures Ua-b illustrate the portions of the register translation unit for performing translation and hazard 
40 detection; 

Figures 15a-b illustrate operand forwarding; 

Figures 16a-b illustrate result forwarding; 

Figures 17a-b IHustrate the detection of forwarding situations; 

Figure 18 is a block diagram of the forwarding circuitry; and 
45 Figures 19a-b illustrate pipe control for multi-box instructions. 


Detailed Pescrtptton of tha Preferred Embodiments 

The detailed description of an exemplary embodiment of the microprocessor of the present invention is 
so organized as foBows; 

1. Exemplary Processor System 

1.1. Microprocessor 

1.2. System 

2. Generalized Pipdine Flow 
56 3. Pipeline Control 

3.1. Generalized Stall Control 

3.2. Pipe Switching 

3.3. Multi-box Instructtons 
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3.4. Exclusive Instructions 

4. !n order Passing\Out-of-Order Completion of Instructions 

5. Pipe Switching 

6. Issuing Instructions Without Regard to Dependencies 

7. Muttt-threaded EX Operation 

8. Register Translation Unit 

8.1. Register Translation Overview 

8.2. Translation Control Registers 

8.3. Register AJiocatlon 

8.4. Instructions With Two Destinations 

8.5. Checkpointing Registers for Speculative Branch Execution 

8.6. Recovery from Exceptions 

8.7. Microcontrd of tfie Register Translation Unit 

8.8. Register ID Translation and Hazard Detection 

9. Forwarding 

10. Conclusion 

This organizational table and the corresponding headings used in this detailed description, are provided 
for the convenience of reference only. Detailed description of conventional or known aspects of the ffilcropro- 
cesser are omitted as to not obscure the description of the invention with unnecessary detail. 

1. Exemplary Proceasor System 

The exemplary processor system is shown in Figures 1a and lb. and Figure 2. Figures la and 1b respec- 
tiveiy illustrate the basic functional blocks of the exemplary superscalar, superpipelined microprocessor along 
25 with the pipe stages of the two execution pipelines. Figure 2 illustrates an exemplary processor system (moth- 
ert)oard) design using the microprocessor. 

1.1. Microprocessor 

30 Referring to Figure 1a, the major sub*blocks of a microprocessor 10 indude: (a) CPU core 20. (b) prefetch 

buffer 30 , (c) prefetcher 35, (d) BPU (t»ranch processing unit) 40, (e) ATU (address translation unit) 50, and 
(f) unified 16 Kbyte code/data cache 60, including TAG RAM 62. A 256 byte instruction line cache 65 provides 
a primary instruction cache to reduce Instruction fetches to the unified cache, which operates as a secondary 
instruction cache. An onboard floating point unit (FPU) 70 executes floating point instructions issued to it by 

35 the CPU core 20. 

The microprocessor uses Internal 32-blt address and 64-brt data buses ADS and DBS. A 256 bit (32 byte) 
prefetch bus PFB, corresponding to the 32 byte line size of the unified cache 60 and the tnstructk)n line cache 
65, allows a full line of 32 instruction bytes to be transferred bo the instruction line cache in a single clock. 
Interface to external 32 bit address and 64 bit data buses is through a bus interface unit B1U. 
40 The CPU core 20 is a superscalar design with two execution pipes X and Y, It Incudes an instruction de- 
coder 21. address calculatfon units 22X and 22Y. execution units 23X and 23Y, and a register file 24 with 32 
32-blt registers. An AC control unit 25 includes a register translation unit 25a with a register scoreboard and 
register renaming hardware. A micfocontrol unit 26, including a microsequencer and microROM, provides exe- 
cution control. 

45 Writes from CPU core 20 are queued into twelve 32 bit wrHe buffers 27 - write buffer allocation is par- 

formed by the AC control unit 25. These write buffers provide an interface for writes to the unified cache - 
non-cacheabte writes go direcdty from the write buffers to external memory. The write buffer togic supports 
optional read sourcing and write gathering. 

A pipe control unit 28 controls ir\slruction flow through the execution pipes, including keeping the instruc- 

50 tlons In order until It is detennined that an Instruction will not cause an exception, squashing bubbles in the 
instruction stream, and flushing the execution pipes behind branches that are mispredicted and instructions 
that cause an exceptions. For each stage, the pipe control unit keeps track of which execution pipe contains 
the earliest Instruction, and provides a stall output and receives a delay input. 

BPU 40 predicts the direction of branches (taken or not taken), and provides target addresses for predicted 

55 taken branches and unconditional change of flow instructions (jumps, calls, returns). In addition, it monitors 
speculative execution in the case of branches and floating point instructions, i.e., the execution of instructions 
speculatively issued after fcvanches which may turn out to be mispredicted, and floating point instructions is- 
sued to the FPU which may fault after the speculatively issued Instructions have completed executk)n. If a float- 
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ing point inslmclion faults, or if a branch is mispredicted (which will not be known until the EX or W8 stage 
for the branch), then the execution pipeline must be repaired to the point of the faulting or mispredicted instruc 
tton (I.e., the execution pipeline is flushed behind that instruction), and instruction fetch restarted. 

Pipeline repair is accomplished by creating checkpoints of the processor state at each pipe stage as a float- 
ing point or predicted branch instruction enters thai stage. For these checkpointed instructions, all resources 
(programmer visible registers, instruction pointer, condition code register) that can be modified by succeeding 
speculatively issued instructions are ctieckpointed. If a checkpointed floating point instruction faults or e 
checkpointed branch is mispredicted, the execution pipeline is flushed behind the checkpointed instruction - 
for floating point instructions, this wll typically mean flushing the entire execution pipeline, while for a mis- 
predicted branch there may be a paired instruction in EX and two instructions in WB that would be allowed to 
complete. 

For the exemplary microprocessor 10, the principle constraints on the degree of speculation are: (a) spoo- 
ulative execution is allowed for only up to four floating point or branch instructions at a time (i.e., the speciiation 
level IS maximum 4), and (b> a write or floating point store will not complete to the cache or external memory 
until the associated branch or floating point instruction has been resolved (le.. the predtetion is correct, or 
floating point instruction does not fault). 

The unified cache 60 is 4.way set associative (with a 4k set size), using a pseudo-LRU replacement al- 
gorithm, with write-through and write-back modes. It is dual ported (through banking) to permii two memory 
accesses (data read, instruction fetch, or data write) per dock. The in.slruction line cache is a fully associative, 
lookaside implementatkjn (relative to the unified cache), using an LRU replacement algorithm. 

The FPU 70 includes a load/store stage with 4-deep load and store queues, a converston stage i32-brt to 
80-bit extended formal), and an execution stage. Loads are controlled by the CPU core 20, and cacheable 
stonas are directed through the write buffers 29 (Le., a write buffer is allocated for each floating point store 
operBtion).Referrif>g to Figure 1 b. the microprocessor has seven-stage X and Y execution pipelines: Instructton 
fetch IF. two instruction decode stages ID1 and ID2, two address calculatton stages AC1 and AC2 execution 
EX. and write-back WB. Note that the complex instruction decode 10 and address calculatfon AC pipe stages 
are superpipelined. 

The IF stage pfovkJes a continuous code stream into the CPU core 20. The prefetcher 35 fetches 16 bytes 
of Instructton data into the prefetch buffer 30 from either the (primary) instructton line cache 65 or the (sec- 
ondary) unified cache 60. BPU 40 is accessed with the prefetch address, and supplies target addresses to 
the prefetcher for predicted changes of flow, allowing the prefetcher to shift to a new code stream In one dock 

The decode stages ID1 and ID2 decode the variable length X86 instructk>n seL The instruction decoder 
21 retneves 16 bytes of instruction dau from the prefetch buffer 30 each dock. In ID1, the length of two in- 
structions is decoded (one each for the X and Y execution pipes) to obtain the X and Y instruction pointers - 
- a corresponding X and Y bytes-used signal is sent back to the prefetch buffer {which then increments for the 
next 15 byte transfer). Also in 101. certain instruction types are determined, such as changes of flow and im- 
mediate and/or displacement operands are separated. The ID2 stage completes decoding the X and Y instruc- 
tions, generating entry points for the microROM and decoding addressing modes and register fields. 

During the ID stages, the optimum pipe for executing an instruction is determined, and the instruction Is 
issued into that pipe. Pipe switching allows fnstructtons to be switched from ID2x to ACIy. and from ID2y to 
ACIx. For the exemplary embodiment, certain instructions are issued only into the X pipeline: change of flow 
instructions, floating point instructions, and exdusive instructions. Exdusive instructions indude: any inslruc- 
t^on that may fault in the EX pipe stage and certain types of instructrons such as protected mode segment loads, 
string ffistructions. special register access (control, debug, test), MuUiplyyOivide, Input/Output, PUSHA/POPA 
(PUSH all/ POP all), and task switch. Exdusive instructions are able to use the resources of both pipes because 
they are Issued alone from the ID stage (I.e.. they are nol paired wfth any other instruction). Except for these 
issue constraints, any instructtons can be paired and issued into either the X or Y pipe. 

The address calculation stages AC1 and AC2 calculate addresses for memory references and supply mem- 
ory operands. The AC1 stage calculates two 32 bit linear (three operand) addresses per dock (four operand 
addresses, which are relatively Infrequent, take two docks). During this pipe stage, data dependendes are 
also checked and resolved using the register translation unit 25a (register scoreboard and register renaming 
hardware) ^ the 32 physteal registers 24 are used to map the 6 general purpose programmer visible logical 
registers defined in the X86 architecture (EAX. ESX, ECX, EDX. EDI. ESI. EBP. ESP). During the AC2 stage 
the register file 26 and the unified cache 70 are accessed with the physical address (for cache htls. cache 
access time for the dual ported unified cache is the same as that of a register, effectively extending the register 
set) - the physical address is eithw the linear address, or if address translation is enabled, a translated address 
generated by the Tt^ 60. 

The AC unit tndudes 8 architectural (logical) registers (representing the X86 defined register set) that are 
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used by the AC unit to avoid the delay required to access in AC1 the register translation unit before accessing 
register operands for address calculation. For instructions that requtfe address calculations, AC1 waits until 
the required data In the architectural registers is valid (no read after write dependencies) before accessing 
those registers. During the AC2 stage, source operands are obtained by accessir»g the register file 26 and the 

5 unified cache 60 with the physical address (for cache hits, cache access time for the dual ported unified cache 
is the same as that of a register, effectively extending the register set) - the physical address is either the 
linear address, or if address translation is enabled, a translated address generated by the ATU 50, 

Translated addresses ere generated by the ATU (using a TLB or translation lookaside buffer) from the lin- 
ear address using information from page tables in memory and workspace control registers on chip. The uni- 

10 fled cache is virtually indexed and physically tagged to permit, when address translation fe enabled, set se- 
lection with the untranslated address (available at the end of AC1) and, for each set. tag comparison with the 
translated address from the ATU (available early in AC2). Checks for any segmentation and/or address trans- 
lation violations are also performed in AC2. 

Instructions are kept in program order until it is determined that they will not cause an exception. For most 

15 instructions, this determination is made during or before AC2 - floating point instructions and certain exclusive 
instructions may cause exceptions during execution. Instructions are passed in order from AC2 to EX {or in 
the case of floating point instructions, to the FPU) - because integer instructions that may still cause an ex- 
ception in EX are designated exclusive, and therefore are issued atone into both execution pipes, handling ex- 
ceptions in order is ensured. 

20 The execution stages EXx and EXy perform the operations defined by the instruction. Instmctions spend 

a variable number of ctocks in EX, i.e.. they are alkiwed to execute out of order (out of order completion). Both 
EX stages indude adder, logical, and shifter functtonal units, and in addition, the EXx stage contains multi- 
ply/divide hardware. 

The write back stage WB updates the register file 24, condition codes, and other parts of the machine slate 
25 with the results of the previously executed instruction. The register file is written in PH1 (phase 1 ) of WB. and 
read in PH2 (phase 2) of AC2. 

Additional disclosure on the write buffer 27, speculative execution and the microsequencer may be found 
in U.S. Ser. No. 08/138.654, (Atty Docket No. CX00186). entitled 'Control of Data for Speculative Execution 
and Exception Handling in a Microprocessor with Write Buffer" by Garibay et al, filed concurrently herewith. 
30 U.S. Ser. No. 08/138.783. (Atty Docket No. CXOOIBO), entitled 'Branch Processing Unit' to McMahon. filed 
concurrently herewith, U.S. Ser. No 08/138,781, (Atty Docket No. CX00181). entitled 'Speculative Executwn 
in a Pipelined Processor" to Bluhm. filed concurrently herewith, and U.S. Ser. No 08/138,855 , (Atty Docket 
No. CX001 67) to Hervin et al, entitled "Microprocessor Having Single Ciock Instruction Decode Architecture', 
filed concurrently herewith, all of whteh are Incorporated by reference herein. 
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^JZ, System 


Referring to Figure 2. for the exemplary embodiment, micfoprocessor 10 is used in a processor system 
that includes a sin^e chip menrwy and bus controller 82. The memory/bus controller 82 provkles the interface 
40 between the microprocessor and the external memory subsystem level two cache 84 and main memory 86 
- controlling data movement over the 64 bit processor data bus PO (the data path is external to the controller 
which reduces its pin count and cost). 

Contioner 82 interfaces directly to the 32-bit address bus PADDR, and includes a one bit wide data port 
(not shown) for reading and writing registers within the controller A bidirectional isolation buffer 88 provides 
45 an address interface between microprocessor 10 and VL and ISA buses. 

Controller 82 provides control for the VL and ISA bus interface. A VL/ISA interface chip 91 (such as an 
HT321) provides standard interfaces to a 32 bit VL bus and a 16 bit ISA bus. The ISA bus interfaces to BIOS 
92, keyboard controller 93, and I/O chip 94, as well as standard ISA slots 95. The Interface chip 91 interfaces 
to the 32 bit VL bus through a bidirectional 32/16 multiplexer 96 fonmed by dual high/low word [31:161/115:0] 
so Isolation buffers. The VL bus Interfaces to standard VL slots 97. and through a bidirectional Isolation buffer 
98 to the low double word [31:0] df the 64 bit processor data bus PD. 

2. Generalized Plpetlne Flow 

55 Figure 3 iHustrates the flow of eight instructions through the pipeline, showing the overlapping execution 

of the instructions, for a two pipeline architecture. Additional pipelines and additional stages for each pipeline 
could also be provided. In the preferred embodiment, the microprocessor 10 uses an internal clock 122 whfch 
is a multiple of the system dock 124. In Rgwe 3. the internal clock is shown as operating at two times the 
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frequericy of Urn system dock. 

Durifiy Itie first Internal clock cycle 126, the IDl stage operates on respective instructions XO and YO. Dur- 
ing inlernal dock cycle 128. instructions XO and YO are in the ID2 stage (XO being in ID2x and YO being in 
lD2y) and instructions Xl and Y1 are in the ID1 stage. During internal dock cyde 130, instructions X2 and Y2 

5 are In the IDl stage, instructions XI and Y1 are In the ID2 stage (X1 being in ID2x and Y1 being in lD2y) and 
instructions XO and YO are in the AC1 stage (XO t>eing in ACix and YO being in ACly). During internal dock 
cyde 132. instructions X3 and Y3 are in the ID1 stage, instructions X2 and Y2 are in the ID2 stage, instructions 
XI and Y1 are in the AC1 stage and instructions XO and YO are in the AC2 stage. The instructions continue 
to flow sequentially through the stages of the X and Y pipelines. 

10 As shown in docks 134-140, the execution portion of each instruction is perforirjed on sequential dock 

cydes. This is a major advantage of a pipelined architecture - the number of Instructions completed per dock 
is increased, without reducing the execution time of an individual instruction. Consequently a greater instruc- 
tion throughput Is achieved without requiring greater demands on the speed of the hardware. 

The instrudion flow shown in Figure 3 is the optimum case. As shown, no stage requires more than one 

15 dock cycle. In an actual machine, however, one or more stages may require additional dock cydes to complete 
thereby changing the flow of instructions through the other pipe stages. Furthennore, the flow of instrudions 
through one pipeline may be dependent upon the flow of instructions through the other pipeline. 

A number of factors may cause delays in vark>us stages of one or all of the pipelines. For example, an 
access to memory may miss in the memory cache, thereby preventing access of the data in the lime required 

20 to process the instrudton in one dock. This would require that either, or both, sides of the EX stage to delay 
until the data was retrieved from main mennory. An inslructkin may require a hardware resource, such as a 
multiplier, which is only in one of the execution stages (in EXX in the X execution pipe) in the illustrated env 
bodimenL In this case, the instruction must delay until the resource is available. Data dependencies can also 
cause delays. If an instruction needs the result from a previous tfistruction. such as an ADD. it must wait until 

25 that instruction is processed by the execution unit 

Other delays are caused by "multi-box- instructions; l.e.. instructions whteh are iniplemented using mul- 
tiple microinstructions, and therefore require more than one dock cyde in the EX pipe stage to complete. Ttiese 
instrudions stop the flow of subsequent inslructtons through the pipeline at the output of the ID2 stage. 
The flow of instructions through the pipeline is controlled by the pipe control unit 26. In the preferred em- 

30 bodiment a single pipe control unit 28 is used to control the flow of instructfons through both (or all) of the 
pipes. To control the flow of instrudions through the pipes, the pipe control unit 28 receives "dela/ signals 
from the various units comprising the pipelines 102 and 104, and issues 'stair signals to the various units. 

Although a single pipe control unit 28 is used for both X and Y pipelines, the pipelines themselves are con- 
trolled indeperuJent of one another. In other words, a stall in the X pipeline does not necessarily cause a stall 

35 in the Y pipeline, 

3. Pipeline Control 

Figure 4 illustrates the inter-stage conwiunication between pipeline stages. The stages are arbitrarily des- 

40 ignated as stage N-1. stage N. and stage U*^, Each stage has a unique input STALL from the pipe control 
unit (pipe controller) 28 and an output DELAY The DELAY output is enabled if the stage needs at least one 
more dock to complete the instruction it contains. For each pipeline, the pipe control unit 28 determines wheth- 
er a stage of a pipe is "done- based on the DELAY signal. A stage is 'done' if it is ready to pass its Instruction 
to a succeeding stage. The STALL input to a stage is enabled by the pipe control unit 28 if the stage cannot 

45 transfer an Instrudion to the succeeding pipe stage, because that succeeding stage is delayed or stalled. In 
the prefeiTGd embodiment, a pipeline stage is stalled only tf it is not delayed (i.e., the DELAY signal is false). 

A "Valid* pipe stage is one containing an instruction, either in progress ch- complete. An invalid pipe stage 
does not contain an instruction. An invalid pipe stage Is said to contain a "bubble". "Bubbles' are created at 
the front end of the pipeline 1 00 when the IDl and ID2 stages cannot decode enough instructions to fill empty 

so AC1 and AC2 stages 112 and 114. Bubbles can also be created when a pipe stage transfers its instructk>n to 
the succeeding stage, and the prior stags is delayed. While the pipe stages do not Input or output bits Indicating 
the validity of the stages, bubbles In the stages are tracked by the pipeline control unit 28. 

In some cases, a bubble in a pipe stage may be overwritten by an instructk>n from the preceding stage, 
referred to as a "slip." Pipe stages may also be "flushed*, if they contain an instrudion which should not com- 

55 plete due to an exception condition in a succeeding pipe stage. The signal FLUSH is an input to each pipe 
stage. A pipe stage generates an "exceptton" if its instruction cannot complete due to an error condition and 
should not transfer beyond the current stage. Exceptions can occur k\ the IF stage 106. the IDl and 102 stages, 
and the AC1 and AC2 stages for all instructions. Certain instrudions. desigrtated as "exdusive" instructions 
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may have exceptions occur in Ihe execution stage 116. Furthermore, exceptions can occur for floating point 
instructions. 

3.1. Generalized Stall Control 

In the general case, the pipe controller will stall a stage of a pipeline it the stage is valid and it is not delayed 
and the next stage is either delayed or stalled. This may be described logically for a stage N as: 

STALLn = VN«!dp,#(dN ♦ , + STALLn ♦ i) 

where: 

Vn Is Inie if stage N is valid, 
dn is true if DELAY for stage N is true and 
I denotes that the succeeding term is negated. 
For a six stage pipeline, the description can be expanded as: 

STALLS = false 
STALLS = vj.lds^do 
STAU^ = v4»W4»(d5 + vj^ldfi^dfl) 
STALL3 = V3»ld5«(d4 + V4*!dXd5 + V6«lds#dQ) 
STALL2 = V2«!d2»{d3 + vj#!d3«(d4 + V4*ld4«(d6 ♦ vg^ldg^de))) 
STALLt = Vi»!d,#(da ♦ V2*ld2»(d3 + Vj^lda^Cd^ + V4»ld4»{d5 + VB^Idg-dfl)))) 
When the pipe control unit 2a stalls a stage of a pipeline, it does not necessarily stall the corresponding 
stage of the other pipeline. Whether the other stage is stalled depends upon the sequence of instructions and 
other factors, as descnt>ed below. 

3:2. Pipe Switching 

While the general model above works for an architecture where instructions flow through the pipe they 
enter, a more complex control structure is necessary when Instructions are allowed to switch between pipes 
as shown in Figure 2. The n^chanism for detennining whether a switch will occur is described in greater detail 
hereinbelow. 

In the preferred embodiment, the pipe control unit 26 keeps the instructions "in program order (or 'in or- 
der") until they are passed from their AC2 stage to the EX stage. In order" means that a "junior' instruction 
cannot be at a pipeline stage beyond a 'senior" instruction (a junior instruction's position in the sequence of 
Instructions received by the microprocessor is after that of a senior instruction), although a junior instruction 
may be at the same stage as a senior instruction. Thus, instruction Itm (the junior instruction) can be in AC1x 
while instruction It (the senior instruction) is in ACIy. but It+, cannot advance to AC2x until It advances to 
AC2y, although It can advance without waiting for If^, to advance. 

Due to the sequential nature of the IF stage and the IDI stage, instructions will not get out of order within 
these two stages. The flow of instructions through the ID2, AC1 and AC2 stages, however, necessitates mod- 
ifications to the general stall mechanism. To aid in controlling instruction flow in this situation, the pipe control 
unit 28 maintains a control signal XFIRST for each pipe stage. If XFIRST is true for a particular stage, then 
the lnslmctk>n in this stage of the X pipeline is senior to the instruction in the corresponding stage of the Y 
pipeline. In the illustrated embodinient. with two pipelines. XFIRST indicates which pipeline has the senior of 
the two instructions for a particular stage; for implementations with more than two pipes XFI RST would indicate 
the relative seniority of each instruction at each stage. 

At the output of the 1D2 units, the pipe control unit must determine whether an instruction can proceed to 
either the AC1x or AC1y. A senior inslruction can proceed (assuming it Is valid and not delayed) if the suc- 
ceeding stage of either pipeline is not delayed or stalled. A junior inslruction can proceed (assuming it is valid 
and not delayed) only if the senior instruction in the corresponding stage of the other pipeline will not delay or 
stall. This can be described logically as: 

St3X = V3X*W3X + STALL4X) 

St3Y = V3Y»(d3v + d4r + STALL4Y) 

where sta specifies whether the corresponding pipeline will stall or delay at or below the 102 stage. 

STAUax = V3x*Wax»(d4x + STALUjx) ♦ 'XFIRSTj^staY 
STALLyv = V3Y«ld3Y«(d4y + STALL4Y) + XFlRST3#sl3x 

3.3. Multi-box Instructions 

The EX stage of each pipeline is controlled independently of the other EX stage by microinstructions from 
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the microROM. While many inslruclions are implemented by a single microinstruction, and hence pass through 
the EX stage in a single clock cycle, some instructions require multiple microinstructions for theb* execution 
and hence require more than one clock cyde to complete. These instructions are referred to as •mulli-tKix" 
instructions. 

Because the microROM cannot be accessed by another instruction in the same pipeline during execution 
of a multi-box instruction, a new instruction cannot be passed from ttie ID2 stage of a pipe to an AC1 stage 
of a pipe until after the last microROM access for the multi-box instruction. This Is due lo the microROM being 
accessed during AC1. As the multi-box instruction reads its last microinstruction, the following instruction is 
allowed to access the microROM and enter AC1, so that no bubbles are produced. 

When the ID2 stage of a pipeline receives an instruction from the ID1 stage, it decodes whether an in- 
struction is mulli-box. The pipe control unit 28 will stall the ID2 stage until the rrujlti-box instruction is finished 
with the microROM. The EX stage will signal the end of a multi-box instruction via a UDONE signal. The control 
necessary to support muiti-box instructions may be described as: 

6l3x = W3x«V3x»(d4x * STALU^ + MULTI80X4x»!UDONE4x) 
stay = ld3y*V3y«(d4Y ♦ STALl^y * MULTIBOX4Y«IUDONE4y) 
STALLax = stjx + lXFIR$Ta*st3Y 
STALLoY = stjY + IXFIRSTa^stax 
AmultJ-box instruction can use the resources of ACI. AC2. and EX. Additional pipe control relating to multi- 
box instructions is discussed in connection with Figures 19a-b. in Figure 19a, Iq is in the EX stage of the X 
pipeiina and a multi-box instruction, is In the AC2 (I,.) and ACI (lib)stages. From the viewpoint of the pipe 
control unit the multi-box instruction 1^ Is treated as a single instruction, and a delay in any stage occupied 
by the multi-box instruction will cause all stages associated wfth the multi-box instruction to stall. Thus, a delay 
in lib will cause to stall, even though 1,. is in front of l^b in the pipeline. This is the only sUuatlon in which a 
delay in one stage will result in a stall in a succeeding stage. 

The pipe control unit 28 keeps track of the boundaries between instructions through the use of a head bit 
associated with each microinstruction. The head bit indicates whether the microinstruction is the first micro- 
instrucUon of an instructfon, even if the instruction is a one-box instruction, tf the head bit is not true for a 
given microinstruction, then it is not the first microinstruction. By checking the head bit for each microinstruc- 
tion in a pipeline, the pipe control unit can detemi^lne boundaries between instructions and stall the stages ac- 
30 cordingly, 

3.4. Exclusive Instructions 

Another type of Instruction used in the prefenred embodiment is an "exdusive* Instruction. Any Instruction 
35 which has the possibility of causing an exception while executing in the EX stage is labeled as exclusive. Ex- 
ceptions are discussed in greater detail below. An instruction requiring multiple memory accesses is labeled 
exdusive because it may cause an exception during such an access. Other Instructfons are labeled as exclu- 
sive because they modify control registers, memory management registers or use a resouroesuch as a multiply 
hardware only available in one execution pipe. Exclusive instrucltons may be either single-box or multi-box. 
Exdusive instructions must be executed alone (i.e.. no other instruction is used in the corresponding stage of 
the other pipe), due to the exdusive instruction's effect on the state of the machine or because the instruction 
can benefit from the use of both EX units. Examples of exdusive instructions that may cause an exception in 
the EX stage are DIV (divide), which may cause a divide by zero error, and instructions which must perform 
memory access during the EX stage like PUSHA, Other examples of exclusive instructions from the 486 in- 
4$ structkwi set are: ARPL, BOUND. CALU CLC. CLD. CLI. CLTS. CMC, CMPS, DIV. ENTER. HLT IDIV IMUL 
IN. INS, INT, INTO. INVD. INVLPG, IRET. U\HR U\R, LEAVE. LGDT, UDT, LGS (PM). LSS (PM) LDS (PM)' 
LES (PM). LFS(PM). LLDT. LMSW. LODS. LSL, LTR, MOV (SR), MOVS. MUL. OUT. OUTS POPA, POPF POP 
MEM. PUSHA. PUSHF. PUSH MEM. RET. SAHF. SCAS. SGDT, SIDT. SLOT. SMSW. STC STD STI STOS 
STR. VERR, VERW, WAIT, and WBINVD. where "PM* denotes a protected mode instruction and 'SR- denoted 
so an Instruction using special or control registers. 

The ID1 stage decodes which Instructions are exdusive. The pipe control unit 28 stalls exdusive instruc- 
tions at the ID2 stage until both AC1x and AC1y stages are available. 

Figure 19b illustrates the effect of a delay of an exclusive multi-box instruction. 

An exdusive nuilti-box instruction occupies Ihe EX, AC2 and ACI stages for both the X and Y pipelines. 
55 If any of the stages occupied by the exdusive multi-box instruction delays, the corresponding stage of the op- 
posite pipeline win also delay, and the other stages assodated with the multi-box instruction will be stalled by 
the pipe control unit in order to keep the multi-box instruction together. Hence, if instruction U delays, then 
1^ delays and l„, I,,, 1,^ and ly^ are stalled. With an exclusive multi-box instruction, two head bits, one for each 
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pipeline, are used lo denote the beginning of the instruction. 

4> In order PasslngVOut'of-Order Completion of instructions 

5 Ref en^ing to Figures 1 a and 1 b. as described above, instructions are maintained in order by the pipe control 

unit 28. until Ihey pass from the AC2 stage to the EX stage. An instruction is considered "paesed" to the EX 
stage once execution begins on the instruction, since some preliminary procedures relating to advancement 
to the next stage, such as changing pointers lo the instruction, may be done t>efore all exceptions are reported. 
Once an instruction passes from an AC2 stage to an EX stage, tt can complete its execution out-of-order 
io (i.e.. the junior instruction can continue on to the write back stage before the senior instruction), unless there 
is a resource or a data dependency which prevents the instruction from executing out-of-order. For example, 
a read-af ter-wrile (RAW) dependency would prevent an inslruct'cn from completing its EX stage until the de- 
pendency is cleared. Thus, an instruction such as ADD AX.BX cannot complete its EX stage until execution 
of a previous ADD BX«CX Is oomplated. since the vaiue of operand BX is dependent upon the previous instruc- 
ts tion. 

However, junior instructions whtdi pass to the EX stage without dependencies on a senior instruction may 
complete, and it is therefore possible for many instructions to pass a senior instruction which requires multiple 
clock periods in the opposite EX stage. This aspect of the preferred emtxidiment greatly increases instruction 
throughput 

20 In the preferred embodiment, instructions are maintained in orrler until they cannot cause an exception. 

An exceptkin is caused by a program error and is reported prior to completion of the instruction that generated 
the exception. By reporting the exception prior to instruction completion, the processor is left In a state which 
allows the instruction to be restarted and the effects of the faulting instruction to be nullified. Exceptions In- 
clude, for example, divide-by-zero errors, invalid opcodes and page faults. Debug exceptions are also handled 

25 as exceptions, except for data breakpoints and single-step operations. After execution of the exception service 
routine, the instruction pointer points to the instruction that caused the exception and typically the instruction 
is restarted. 

Any instruction which is capable of causing an exception must be restartable. Accordingly, if an exception 
occurs, the state of the machine must l>e restored lo the state prior lo starting the instruction. Thus, changes 

30 to the state of the nwchine by the instruction causing the exceptbn and subsequent instructions must be un- 
done. Typically, restarting the instruction involves resetting the state of the register file and restoring the stack 
pointer, the instruction pointer and the flags. Because most exceptions occur at the AC2 stage, the exception 
is asserted at the output of the AC2 (except for exclusive instructions that cause exceptions in the EX stage). 
Instructions are restarted at the ID1 stage. 

35 if the instruction causing the exception is junior to the instruction in the conresponding AC2 stage (the 

neighboring instruction), then the neighboring instruction may continue to the EX stage. However, if the in- 
struction causing the exceptk)n is the senior instruction, both instructions must be restarted. In other words, 
the state of the machine must be restored to the state which existed prior to any changee caused by the in- 
struction causing the exceptron and will allow Instructions earlier in the program sequence to continue through 

40 the pipelines. 

In the preferred embodiment, if a non-exclusive multi-box instruction is executing in one pipeline, multiple 
instructions may flow through the other pipeline during execution of the multi-box instruction. Because a multi- 
box instruction may use the AC1 , AC2 and EX stages, only the stage processing the miaoinstruction with the 
head bit for the multi-box instructions is kept in order. Hence, the AC1 and AC2 will not prevent junior instruc- 
45 tions from advancing if the stages do not contain the microinstruction with the head bit Two factors will control 
whether instructions can continue to flow: (1) whether the multi-box instruction creates a data dependency 
with a junior instruction or (2) whether the mi^ti-box instruction causes a resource dependency with a junior 
instruction. 

Resource dependencies are created when the junior instruction needs a resource being used by the senior 
50 instruction. For example, In the preferred embodiment only the X-plpe EX unit has a mulUplier. in order to re- 
duce the area for the EX units. If a multi-box instruction Is operating In the X-side EX unit, a subsequent In- 
struction needing the multiplier cannot be executed until after completion of the senbr instructk>n. 

Figure 5 illustrates a flow chart illuslrating the general operation of the pipe control unit 28 with regard to 
the passing of instructksn from the AC2 stage to the EX stage and the completion of the EX stage. The pipe 
56 controDer determines (200) whether an instruction can cause an exception at its present stage (or beyond). If 
not, the instruction is allowed to complete (202) ahead of senior Instructions (so long as those senior instructk>n 
can no longer cause an exception). If the instruction may still cause an exception, then the pipe controller will 
not allow the instruction to change the state of the microprocessor before all senior instructrans have made 
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Iheir changes to the slate of the micro processor at that state (204). In other words, ait state changes are made 
In program order untii the instruction can no longer cause an exception. 

In the more specific case, discussed above, block 204 of the flow diagram is implemented by maintaining 
the program order of instructions through the AC2 stage. For the majority of instructions in the X86 instruction 
set, it can be detenmined whether an instruction will cause an exception by the AC2 stage. Exclusive Instruc- 
tions, which are allowed to cause an exception in the EX stage, are executed alone In the EX stage so that 
the state of the machine may be restored if an exception occurs. 

While the above description provides that the instructions are kept in order through the point where they 
can no longer cause an exception, an alternative, more general, method of pipe control would be to allow in- 
structions to fM-oceed out of order, so long as the instruction did not alter the state of the processor. 

5. Pipe SwUchlnfl 

Referring to Figures 1a and 1 b. the pipe control unit 28 controls whether an instruction switches between 
pipelines after the ID2 stage. Hence an instruction may progress through the pipelines from ID2x to either AC1 x 
or AC1y and from ID2y to either ACIx or AC1y under the control of the pipe control unit 28. 

In the prefen-ed embodiment, the pipe control unit 28 will decide which p^e. X or Y. to place an instruction 
based on certain criteria The f s^st criteria is whether one pipeline has a bubble which could be removed, tf so, 
the pipeline will try to move the most senior of the instructions in the ID2 stage into that pipeline. Thus if AC1x 
is valid and AC1y is invalid, and the instruction in ID2x is the senior of the two Instructions in the 1D2 stage, 
then the pipe control unit 28 will tiansfer the instruction from ID1x to ACIy. 

The second criteria is to prevent new bubbles in the pipeline from occurring. To prevent bubldes f n^m oc- 
curring, the pipe control unit 28 will attempt to keep dependent pairs of instruction, where the dependent In- 
struction may be delayed, from affecting other instructions. To accomplish this, in the preferred embodiment, 
the pipe control unit 28 will keep adjacent instructions in program order from being on top of one another in a 
pipeline. 

Figure 6a illustrates this problem. At time T1. inslrucllon 11 is in EXx, instruction 12 is in EXy. instructbn 

13 is in AC2y and instruction 14 is in AC2x. 12 has a read-af ter-wrlte dependency on 11 ; in other words, for in- 
struction 12 to be properly processed in the EXy stage, it must wait for the outcome of instruction II in the EXx 
stage. For example, 1 1 could be an ADD AX.BX instruction and 12 could be an ADD AX,CX instruction. 12 cannot 
complete because one of Its operands will not be ready until after 11 completes. At time T2. II completes, leav- 
ing a bubble in EXx, 12 Is executing in EXy. 13 cannot proceed to the EX stage until 12 completes. 14 cannot 
proceed to the EX stage because it is junior to 13 and, as stated above, instructions cannot proceed past a 
senior instruction until entering the EX stage. 

The consequence of maintaining adjacent instructions in program order from being on lop of one another 
in a pipeline is shown in Figure 6b. In this example, the pipe control unit 28 has ordered the pairs in AC2 at 
time T1 such that 13 is in AC2x and U is In AC2y. The reason for ordering the instructions in this manner is to 
prevent 13 from being above 12 In the Y pipeline. Thus, at time T2. II has completed the EX stage and moves 
to the writeback stage. 13 can now move Into EXx, thus preventing the occurrence of a bubble in EXx. Similarly. 
15 can nwve into AC2x. 

In some instances, the pipe control unit 28 must place adjacent Instruction above one another in a pipeline. 
Typically, this situation is caused by an X-only instruction, whteh must be placed in the X pipeline, or because 
the pipe control unit 28 needed to remove a bubble, which necessitated a perturbatbn in the desired order. 
Figure 7 Illustrates such a situation. At time T1. II and 12 are in EXx and EXy. respectively. 13 and 14 are in 
AC2x and AC2y respectively, 15 and 16 are in AC1 y and ACIx, raspectwely, because 16 is an X-only instructton 
and therefore the pipe control unit 28 was forced to put 16 into ACIx, even though doing so forced 15 to be on 
top of 14 in the Y pipeline. 17 and 18 are in lD2x and I02y, respectively. 14 has a read-af ter-write dependency 
on 13 and 16 has a read-af ter-write dependency on 15. At T2. 11 and 12 have moved to the WB stage and 13 and 

14 have moved into the EX stage. 16 has moved to AC2x and 15 has moved to AC2y: therefore the pipe control 
unit 28 has allowed 17 and 18 to switch pipelines In order to prevent 17 from being on top of 16 In the X pipeline 
19 and 110 have moved into ID2. 

At T3. 13 has completed In EXx and moved to EXy and 14 remains In EXy to complete Its operation. As 
described in connection with Figure 6a. neither 15 or 16 can proceed down either pipeline, and thus instructions 

15 ami above remain in their respective stages. At T4, 14 completes and 15 and 16 move Into EXy and EXx. re- 
spectively. 17 and 18 move to AC2y and AC2x. respectively, 19 and 110 move to ACly and ACIx, respectively, 
to prevent adjacent instructfens 19 and 18 from both being in the X pipeline. 1 1 1 and 1 12 move Into the ID2 stage! 

At T5. 15 completes and 17 nooves Into EXy. 16 slays in EXx because of the read-af ter-write dependency. 
19 moves to AC2y. Ill moves to ACIy and 13 moves to ID2x. As can be seen, the potential bubble created by 
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18 reinaining in EXx has been avoided due to proper sequencing of the instructions by the pipe control unit 
28. 

Although a specific ordering of instructions has been described in connection with Figures 6-7, other meth- 
ods of sequencing instructions may be used to promote the efficient flow of instructions through the pipeline. 
5 Also, the point of switching need not be at the ID2 stage. As shown above, the pipe control unit 28 uses the 
switching point to provide a sequence of instructions which reduces dependencies between instructions which 
couid cause bubbles to created. 

A flow chart illustrating the general operation of the pipe control unit with regard to pipe switching is shown 
In Figure 8. The pipe controller determines (210) whether the instruction must be placed down a certain pipe- 
to line, such as an X-only instruction. If so, the pipe conlK>l unit 28 will place (212) the instruction in that pipeline 
as available. If the instruction can be placed in any pipe» the pipe control unit 28 will determine (214) whether 
there is a bubble in either of the pipelines which could be filled. If so, the pipe control unit 28 will move (216) 
the instruction into the stage with the bubWe. If there are no bubbles (or if both pipelines are available), the 
pipe control unit 28 wilt place instructions in the X or Y pipelines based on an evaluation of the best sequence 
15 for avoiding dependencies ( 218 and 220). As described above, in one embodiment, the pipe controller avoids 
dependencies by avoiding the placement of adjacent instructions above one another in the same pipeline. 

6. Issuing instructions Without Regard to DeperKjencles 

20 Instructbns from ID1 to ID2 without regard to dependencies which may exist between the two instructions . 

An alternative approach ts to determine whether a pair (or more) of instructions have a dependency, and if so, 
issue the first instruction with a bubble in the corresponding stage in the other pipe such that the bubble re- 
mains paired with the issued instruction through the pipeline. Consequently, the mjmt>er of instructions that 
are processed over a given time period will t>e reduced. 

25 To improve performance, the microprocessor disclosed herein will issue instructions with dependencies 

simultaneously into the pipelines. The dependency is checked at the point where the instruction needs to use 
the data for which it is dependent That is, the point at which the dependency will cause a stall in the pipeline 
depends upon the nature of the dependency - if the dependent data is needed for an address calculation, the 
stall will occur in AC1 . while if the data is needed for execution, the stall will occur in EX. Until the time of the 

30 stall, movement of the instructions in the pipe or other mechanisms may resolve the dependency, and thus 
provide for a more efficient flow of instructions. 

7. Mulththreaded EX Operation 

35 Refem'ng to Figure 1 a and 1 b. the microsequencer circuitry 23 provides independent flows of microinstruc- 

tions lo the EX stages. Hence, control of the EXx stage is Independent of control of the EXy stage. 

By controlling the execution of both EX stages by two independent microinstruction flows, rather than us- 
ing a single microinstruction word to control both EX stages, greater flexibility in performing the instruction is 
provided, thereby Increasing performance. Further, the additional hardware which would be necessary fa* sin- 
40 gle microinstruction control of the two EX stages is avoided. 

In particular, some exclusive instructions can benefit from the use of both the EXx and EXy stage. As well 
as using both EX stages, the exclusive instruction has access to both AC stages for address calculations —in 
such a case, the AC is also microinstruction contn:>lled. 

While both EX (and AC) stages are being used to execute a single instruction, the respective EX stages 
45 continue to receive two independent flows of microinstructions from the microsequencer. The operation of the 
two EX units Is maintained by proper coding of the microinstructions. 

8. Register Translation Unit 

so 8.1 . Register Translation Overview 

Referring to Figure la and 1b, the register translation unit 25a Is used for Instruction level data hazard 
detection and resolution. Belore completing execution in the EX pipe stage, each instruction must have its 
source operands valid. The register translation unit is used lo track each of the registers to determine if an 
55 active instruction has an outstanding write (a 'write pending"). 

If an instruction has a source register with a write pending, a residual control word (See Section 9 and 
Figures 15a-band 1 6a- b) associated with the instruction is marked at the AC1 stage to indicate that the source 
reg'ster has a write pending. As the instruction progress through the pipeline, each stage •snoops" the write- 
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back bus to detect a wrile lo ihe cJepeiidenl register. If a write lo the dependent register is detected, the write 
pending field In the residual cunlrol word associated with the source register is cleared. 

Figure 9 illustrates a general block diagram of the register translation unit 25a. The Physical Register File 
(24 in Figure 1a) Includes thirty two physical registers for storing inforn>ation directed to the eight logical reg- 

5 isters of the X66 architecture. Access to the physical registers is controlled by the register translation unit 25a. 
State information relating to the physical and logical registers is stored in translation control registers 236. 
Translation control circuitry 238 manages access to the physical registers based on the state information. 

A true data dependency arises from a HA\N hazard which prevents the instruction from completing. There 
are also dependencies corresponding to a WAR (write- after-read) hazard, called an antidependency, and a 

w WAW (write-after-write) hazard, called an output dependence. Antidependencies and output dependencies, 
which are not true data dependencies, can be removed through the use of register renamirrgp which is controlled 
by the register translation unit 25a. In register renaming, more physical registers are provided than the archi- 
tecture defines (logically or architecturally). By assigning a new physical register each time a logical register 
is to be written (destination of result), the register is renamed and eliminates both WAR and WAW hazards. 

15 The X86 architecture defines 8 general purpose programmer visible registers (EAX. EBX, ECX, EDX, EDI. 

ESI, EBP, ESP). In the illustrated embodiment, there are 32 physical registers which will be used to map the 
eight general purpose registers (logical registers). Since the microprocessor will predict and execute instruc- 
tions t)efore a condKlonal tiranch has completed execution, the register translation unit must be able to handle 
the consequences of a mispredicted branch. If the prediction is incorrect the microprocessor must restore the 

20 state back to the point of the conditional branch. As descrit>ed t>elow, checkpointing is used to save state in<^ 
formation before the speculative path is taken. Recovery from a incorrectly predicted conditional branch in- 
volves reverting to the checkpointed physical registers. 

For each AC1 pipe stage, the following operatbns are completed by the register transiating and renaming 
hardware. 

25 1. AJIocate (rename) up to two new registers which are destinations of the current instructions in the AC 

pipe stage. The allocation will proceed in program order due to dependencies created If both instructions 
specify the same register as destinatnns. 

2. Check for RAW dependencies for instructions in AC pipe stage. 

3. Check physical register IP's on the write back bus for registers used during AC for address calculations 
30 to enable bypassing and cleanng of the write pending bit in the register translation unit 

4. Logk:al to physical translations for up to four registers. 

8.2. Translation Control Registers 

35 Figure 10 illustrates the translation control registers 236. A Logical ID register 240 maps logical registers 

to physical registers. The Size register 242 stores a code corresponding to the size of the logical register to 
which the physical register is assigned. This aspect Is discussed in greater detail below. The Current register 
244 indicates the registers which are the mosx recently assigned for a given logical register. Thus, every time 
a new physical register is aflocated, the current bit for the physical register which previously was the current 

40 register for the con-esponding logk^l register is turned off and the current bit for the newly allocated register 
is turned on. Consequently, at any time, the Current register 244 has eight bits on and twenty-four bits off. 
For each physical register, the Pending register 246 has a bit which indteates whether a write lo that physical 
register is pending. 

Four checkpoint registers 248. ChkpntO - ChkpntS, are used to store a copy of the Cun-ent register 244, 
45 each time a checkpoint occurs. In the preferred embodiment, checkpoints occur whenever a conditional branch 
or a floating point operation passes through AC1 . The checkpoint registers 248 are written to in a rotating basis. 
Exception Restore registers 250 store the current bits for each instruction in AC1 , AC2 and EX, as they existed 
before the altocation for the instruct bn was made i.a the AC1 stage. The contents of the Exceptton Restore 
registers follow the instructions as they move from stage to stage. 

50 

6.3. Register Allocation 

For each instruction which writes results lo a logical register, a new physical register Is allocated by the 
register translation unit 25a, The register allocation process first identifies a •free" physical register, i.e. a reg- 
55 ister which is not in use. Detection of free registers is discussed in connection with Figure 11. Once a free reg- 
ister is located, the logical register number is placed in the physical register data structure and is marked cur- 
rent The previous physical register whk;h represented the logk:aI register has its current bit cleared. 

Circuary for identifying a free register is shown in Figure 10-11. A Register Busy register 252 has one bit 
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location for each physical regrster. Ead'i bil of i\m Register Busy register is set responsive to corresponding 
local ions in the Pending, Current, Checkpoint and Exception Restore registers. As shown in Figure 11» bit n 
of liie Register Busy register 252 is the result of a logical OR operation on the nth bit of the Pending, Current, 
Checkpoint and Exception Restore registers. A register is free if it corresponding bit in the Register Busy reg- 
5 ister is set to "0* and is in use if the corresponding bit is set to "1 *. 

Upon allocation, the corresponding bit of the Current register Is set to "1" to mark the physical register as 
the consent register. A code is placed in the corresponding three bits of the Logical ID register 240 to indicate 
the logical register to which the physical register is assigned, and the correspondtng bits of the size register 
are set to the size of the logical register being allocated (see Table 1 below). The pending bit corresponding 
10 to the physical register is also set The instruction causing the allocation will vmte to the assigned physical 
register and any reads by subsequent instructions from the logical register will result in reads from this new 
physical register. This renaming will occur during the AC1 pipe stage and will be processed in program order. 
Processing the instructions in program order is required for the case where both instructions in ACIx and AC1 y 
specify the same logical register as a source and destination. As an example, this can occur If both instructions 
15 are an ADD and the AX register is defined as both a source and destination. Through register renaming two 
new physical registers will be allocated for the logical AX register, with the last one being marked as the current 
one. The example below shows how each instruction is renamed. 

First instruction: (ADD AX. BX). Assume the physical register IDs for the AX and BX registers are currently 
"1* and respectively, when the ADD instruction is received in AC1. Since the AX register is also the 
20 destination, a new physical register will be allocated for AX. This physical register will have an ID of "3" 

(assuming that physical register "3* is free). The add instruction would then add physical registers '1" and 
"2* and write the results into register "3". 

AX (physical register 1) + BX (physical register 2) => AX (phy^cal register 3) 
Second instruction: (ADD AX»8X). Since the AX register is a destination, a new physical register will be 
25 allocated for AX. This will have the ID of "4". Since the previous Instruction renamed the AX register to the 

physicat "3", it will be used as the AX source for the ADD, since it is marked as current as of the time of 
the a!iocatk>n. Therefore, the second ADD instruction would add physical registers '3" and '2" and write 
the results into register *4.' 

AX (physical register 3) + BX (physical register 2) -> AX (physical register 4) 
30 Since the X86 architecture albws certain registers to be addressed as words (e.g. "AX*), low bytes (e.g. 

"AL"), high byies (e.g. "AH") or double words (e.g. 'EAX"), a size is specified for each allocation, based on 
how the register was specified by the instruction. The possible allocatable portions of a register are shown in 
Figure 12a for the EAX register. Each physical regbter has a corresponding two bit field In the Size register 
which stores the code. Exemplary codes are shown in Table 1. 

35 

TABLE 1 


Codes for Size Register 

Code 

Size 

Example 

00 

word 

AX 

01 

low byte 

AL 

10 

high byte 'aH 

11 

double word EAX 


A method for register translation using variable size registers Is shown in the Figure 12b. The translatton 
control circuitry in the register translation unit (25a in Figure 1a) compares the size of the logical register to 
be allocated with the size of the current register for that logical register and determines whether the register 
may be allocated or whether the instruction must be stalled. 

A request for allocation is received (258). and the size of the register to be allocated is compared (260 and 
262) with the size of the corresponding current register. If two instructions specify the same togical destinatton 
register but as different sizes (i.e., AH and AL), where the logical destination of the second instructron in pro- 
gram order does not fully include the portion of the logical register allocated to the first nstruction. a RAW 
dependency based on size is created. Accordingly, a register cannot be allocated until this dependency has 
been resoked (2&4). 
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If Uie sUe of the logical register with a pending write of an instruction encompasses the portion of the logical 
register specified by an earlier instruction (as defined in Table 2 below, using the EAX register as an example), 
the new register can be allocated (266). 

5 Table 2 


Register Sizes which Aflow Allocation for Registers with Size Dependencies 

Size of Register with Pending Write 

Ailowabfe Sizes for Allocating New Registers 

AL 

AUAX,EAX 

AH 

AX, EAX 

AX 

AX, EAX 

EAX 

EAX 

' ■ — i 


8.4. Instructlpns With IWo Destinations 

The majority of X86 instructions specify only one register destination. TTiere are a few which specify two 
register destinations (e.g., XCHG AX,BX). So as not to complicate the register translation unit hardware, only 
one destination for an instruction can be renamed each dock. Therefore, a special case for the instructions 
which specify two destinations is used. These instructions, while in the AC1 pipe stage, will stall any other 
instruction from using the register translation hardware for one clock, so the second destinatton can be re- 
named. 

Checkpointing Regleters for Speculative Branch Execution 

Referring to Figure 1 0. the microprocessor will predict the direction of a branch (conditional change of flow), 
and begin executing instructions in the predicted directbn. before the branch has been resohred. If the branch 
is misprediction, the microprocessor must restore the processor state back to the point of the branch. 

The register translation unit (25a In Figure 1a) allows the microprocessor to save the processor state at 
the boundary of a branch by checkpointing the registers ~ copying the Cun^ent register 244 to one of the Check- 
point registers 248 - before speculatively executing instruclfens in the predkrted direclfon of the branch. The 
Checkpoint registers 248 are written to in a rotating order. 

In the preferred embodiment the regislers are also checkpointed for floating point operations. 

Since checkpointing allows the microprocessor to return to the state defined by the checkpoint regtetftfs, 
it could be used for every Instruction. However, resources must be provided for each checkpoint, and therefore, 
there is a trade-off between the f unctfonality of checkpointing and the hardware resources to be allocated to 
checkpointing. In the illustrated embodiment, four checkpoint registers are used, allowing up to four check- 
points at any one time. 

Recovery from an incorrectly predicted branch (or a floating point error) involves reverting to the check- 
pointed physical registers. When a brandi enters the AC stage of the pipeline the Current register 244 is copied 
to one of the Checkpoint regislers 248 While executing instructions in the predicted direction, new registers 
will be allocated. When a new register is allocated, the physical register that is marked current will clear its 
current bit. as it normally would. If the predicted direction is incorrect, then the Checkpoint register 248 asso- 
ciated with the branch is copied to the currant register, which will restore the state of the phyacal registers to 
the state which existed immediately prior to the branch. Hence, the microprocessor may recover from a mis- 
predk:ted branch or a floating point enror in a single clock cycle. 

8.6. Recovery from Exceptions 

Referring to Figure 10. recovery from excepttons is similar to recovery from a mispredicted branch, tf an 
exceptran occurs with a given stage (AClx. ACly, AC2x. AC2y. EXx. EXy) . the Exception register 250 asso- 
ciated with that stage is copied into the current register. Since the Exception register for a given stage contains 
a copy of the Current register 244 as it existed prior to the allocation (which occurred in the AC1 stage) for the 
present instruction in the stage, copying the associated Exception register 250 to the Current register 244 will 

17 


BNSOOCID <EP _064908SA1 .1 » 


EP0 649 085 A1 

reset Ih© association of the physical registers to the logical registers to that which existed before the exception 
causing instruction entered AC1 . Thus, the present invention allows the state of the nriachine lo be modified, 
even though the instruction nidifying the state may later cause an exception. 

To determine which Exception register should be used to restore the Current register 244, the register 
5 translation unit 25a uses information from the pipe control unit (28 in Figure 1 a). When an exception occurs, 
the pipe control unit will flush stages of the pipelines. Using signals from the pipe control unit which indicate 
which stages were flushed, and which stages were valid at the time of the flush, along with the XFIRST bit for 
each stage, the register translation unit will determine the most senior flushed stage. The exception register 
corresponding to this stage is copied into the Current register 244. 

10 

8.7. MIcrocontrol of the Register Translation Unit 

Referring to Figure la. the register translation unit 25a is n^Hmally controlled via signals produced by the 
pipeline hardware. In certain instances, however, it is beneficial to control the register translation unit 25a 

IS through microcode signals generated by the microsequencer In microcontroiler 26 as part of an instruction. 
For example, exclusive instructions will require access to the register translation unit hardware to determine 
which physical register is mapped to a logical register. Instructions such as PUSHA(push all) require a logical 
to physical translation of all eight logical registers during their execution. 

To efficiently accommodate the need lo access the register translation unit by exclusive instructions, con- 

20 trol signals are multiplexed into the register translation unit 25a through multiplexers controlled by the micro- 
code, as shown in Figure 1 3. Control signals generated by the hardware and by the microcode (via the micro- 
sequencer) are input to a multiplexer 260. The multiplexer passes on of the control signals based on the value 
of a Microcode Select signal which controls the multiplexer 260. The Microcode Select signal is generated by 
the microcode. Hence, if the microcode associated with an instruction needs the register translation unit 25a. 

25 one of the microinstruction bits enables the multiplexers 260 to pass the microcode control signals rather than 
the signals from the pipeline hardware. Other bits of the mlcroinstruction(s) act as the control signals to the 
register translation unit 25a to enable the desired function. Instructions which do not need the register trans- 
lation unit for their execution will enable the multiplexers to pass only the control signals generated by the hard- 
ware. 

30 

8.8. Register ID Translation and Hazard Detection 

Responsive to a request for a logical register, the register translation unit (25a in Figure la) will supply the 
identification of the current physical register mapped to the requested logical register. Also, the register trans- 

35 lation unit will output eight bits, one for each logical register, indicating whether the current physical register 
for the associated logical register has a write pending. These bits are used to detect RAW hazards. 

In the preferred embodiment, the register translation unit is formed from a plurality of cells, each cell rep- 
resenting one physical register Figure 14a illustratas a schematic representation of one cell 270 as it relates 
to register ID translation and hazard detection. In response to a 3-bit code representing one of the eight logical 

40 registers placed on the transjd bus. a 5-blt code representing the cun-ent physical register for the specified 
logical register will be place on the phy_id bus. Each cell 270 receives the code from the transJD bus. The 
3-bft code on the transjd bus is compared to the bits of the Logical ID register con-esponding to that cell. In 
the preferred embodiment the bits of the control registers 240-252 are divided between the cells such that 
each ceO contains the bits of each register 240-252 con-esponding to its associated physScat register. 

45 The Logical ID bits are compared to the 3-bit code by comparator 272. The match signal is enabled if the 

3-bft code equals the Logical ID bits. The match signal and the Cun-ent bit for the cell are input to AND gate 
274. Hence, if the physical register represented by the cell is associated with the specified logical register, 
and if the physical register is marked as the current register for the specified logical register, the output of the 
AND gate 274 will be a "1". The output of AND gate 274 enables a 5-bit tri-state buffer 276; If the output of the 

so AND gate Is a '1*. the buffer passes the physical ID associated with the cell to the phyjd bus. For a given 
logics register ID. only one physical register wlR be current; therefore, only one cell will have its tri-state buffer 
enabled. 

The Logical ID bits are also input to a 3-to-8 decoder 278. Thus, one of the eight outputs of the decoder 
278 will be enabled responsive to the logical register mapped to that cell. Each output of the decoder 278 is 
55 coupled to the Input of a respective AND gale 280 findividually denoted as AND gates 280a-280g). Each AND 
gate 280 also receives the Current and Pending bits for the physical ragister associated with the cell. The output 
of each AND gate 280 is coupled to a respective hazard bus associated with each logical register. For example. 
AND gate 280a is coupled to the hazardEAX bus which is associated with the EAX logical register. AND gate 
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280g is coupled to Ihe hazardESP bus whk;h s associated with the ESP logical register. 

For a given cell» at most one AND gale 280 will be enabled, if that cell represents the logical register map* 
ped to the physical register represented by that celK and If the physical register Is marked current with a write 
pending. As shown in Figure 14b, the hazard buses perform a wired-OR on the outputs of each cell. For each 
5 hazard bus, only one of the associated AND gates 280 wfH be enabled, since only one Current bit associated 
with the logical register will be enabled. If the Pending bit associated with the current physical register Is also 
enabled, the con^esponding AND gate 280 will be enabled and the hazard bus will indicate that there is a write 
pending to that logical register. This infom^ation is used to detect RAW hazards. 

10 9. Forwardinfl 

As described above, a RAW dependency wil) cause the microprocessor to stall on the dependent instruc- 
tion. In the preferred embodiment, forwarding" is used to eliminate RAW dependencies in certain situations 
In order to increase Instruction throughput. Forwarding modifies instruction data to eliminate RAW depand- 
15 encies between two instructions which are both in the EX stage at the same time. 

Two types of forwarding are used in the preferred emtxtdiment. *Operand forwarding" forwards, under cer- 
tain conditions, the source of a senior MOV {or similar)Jnst ruction to a junior instruction as soia^ce data for 
that instruction. "Result forwarding" forwards, under certain conditions, the resultof a senior instruction directly 
to the destination of a subsequent MOV (or similar) instruction. 
70 The following code illustrates operand forwarding: 

1) M0VAX,BX 

2) ADDAX.CX 

Refening to Figures 15a-b. using operand forwarding, the junior ADD instruction will be effectively modi- 
fied to BX+CX=^AX. Each instruction is associated with a residual conlrcrf information stored in a residual con- 

25 trol word which includes the sources (along with fields indicating whether there is a write pending to each 
source) and destinations for the operation, among other control information (not shown). Thus, assuming that 
physical register "0" Is allocated to logical register BX a nd physical registw "I is allocated to logical destination 
register AX, a "O* is stored in the SRCO (source 0) field and a is stored In the DESO (destination 0) field of 
the residual control word associated with the MOV instruction. Similarly, assuming that physical register •2" 

30 is allocated to logical register CX, forwarding allows a T to be stored in the SRCO field of the residual control 
word associated with the ADD instruction (since the destination register of the MOV instruction is one of the 
sources for the ADD instruction) - a •2* is stored in the SRC2 field and a "3" is stored in the DESO field (since 
register renaming will find a free register for the logical destination AX register). 

As can be seen, a RAW dependency exists between the MOV and the ADD instruction, since the MOV 

35 Instruction must write to physical register prior to execution of the ADD instruction. However, using operand 
forwarding, this dependency can be eliminated. As shown in Figure 15b, operand forwarding does not affect 
the MOV command. However, the residual control wwd of the ADD instruction is modified such that the SRCO 
field point to physical register associated with logical source register BX for the MOV. 

Similarly, result forwarding modifies the residual control word of a junior MOV instruction with the result 

^ of a senior instruction. To desaibe result forwarding, the following sequence is used: 

1) ADDAX,BX 

2) MOV CX,AX 

Referring to Figures 1 6a-b, result forwarding modifies Ih e MOV command such that the CX register is load- 
ed with the data generated as the result of the ADD instruction. Physical register ''O* is allocated to logical 

<5 source register BX, physical register *r is allocated to logical source register AX, physical register '2* is allo- 
cated to logical destination register AX. and physical register *3" is allocated to logical destination register CX. 
Thus, a RAW dependency exists between the two instructions, since the destination of the ADD instruction 
(physical register 2) is the source of the MOV instruction. 

After result forwarding (Figure 1 6b), the ADD instruction remains unchanged; however the residual control 

50 word associated with the MOV instruction is modified such that the destination register CX (physical register 
3) receives its data from the write-back bus associated wfth the EX unfl performing the ADD (shown in Figure 
16b as the X-slde write-back bus) at the same time AX is written. Consequently, the RAW dependency is elim- 
inated, and both the ADD and the MOV instructions may be executed simultaneously. 

Forwarding is used only under certain conditk>ns. One of the instructions in the sequence must be a MOV 

55 instruction or similar 'non-working- instructkjn. A non-working instruction is one that transfers operand data 
from one location to another, but does not perform substantive operations on the data. A working instruction 
generates new data in response to operand data or modifies operand data. In the X86 instruction set the non- 
woriting instructions would include MOV, LEA, PUSH <reg>, and POP <reg>. Also, OR <reg1>.<reg1 > and AND 
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<reg1>,<reg1> (where both the source and destination registers are the same) can be considered "non-work- 
ing* ir\st ructions because they are used only to set flags. 

Further, in the preferred embodiment, forwarding is used only in cases where both instructions in the se* 
quence are in their respective EX units at the same dock cycle. Forwarding searches up to three instructions 

5 (in program order) ahead of an instruction in the AC2 stage to determine whether a forwarding case can occur. 
Even if the forwarding instruction is two instructions ahead* forwarding can occur if the forwarding instruction 
delays in the EX stage lor>g enough for the instruction in the AC2 stage to move to the EX stage. 

As shown in Figure 17a, in the situation where instructions *1* and "2" are in the X- and Y-side EX units, 
respectively, end instructions "3* and "4" are in the X- and Y-side AC2 units, inst motion *4* looks at instructions 

10 "3* and '1" to determine whether an operand or result forwarding situation is possible. Since instruction "4" is 
stilf in the AC2 stage, it cannot forward with instruction *1* unless instruction '1* delays in the EX stage until 
instruction "4" is issued into the Y-side EX stage. Similarly, if a forwarding situation is possible with instruction 
"3*. the forwarding will occur only if both "3* and "4" are issued to the respective EX stages such that they are 
concuirently In the EX stage for at least one dock cyde. 

f5 lnstructk)n "4" does not look to instructk>n '*2' for a forwarding situation, since both instructions cannot 

be concurrently in the EX unit given the architecture shown. Bypassing may be used to reduce the latency 
period of RAW dependencies between instruction "4" and '2*. With alternative pipeline configurations, such 
as an architecture which allowed switching pipes at the AC2/EX boundary, it would be possible to forward be- 
twaen instruction '4** and "S*. 

20 Figure 17b illustrates the conditions monitored for forwarding in connection with instruction "3* given the 

initial conditbns set forth in connection with Figure 17a. In this state, only instruction *2* is monitored for a 
forwarding situation. Instruction T cannot forward with lnstructk>n "3'* because they cannot concun-ently be 
in the EX stage. Instruction *3" cannot have a RAW dependency on instruction "4* because instruction is 
junior lo inslructk)n "3" (although, as shown in Figure 17a, instruction V can have a RAW dependency on 

2S instruction "3"). 

A block diagram of die forwarding control circuitry Is shown in Figure 18. The drcuitry of the forwarding 
control stage is associated with the AC2 stage. The forwarding control circuitry 300 indudes operand monitor 
and control circuitry 302 to monitor the source operands of the instructions in the AC2 pipe stage and the 
source and destination operands of the instructions in the EX stage and to modify the residual control infor- 
30 maticn as descrik)ed above. Further, once the possibility of a forwarding situation is detected, instruction move- 
ment monitoring circuitry 304 of the forwarding control drcuitry 300 monitors movements of the instructions 
to detect the presence of both instructions in the respective EX units to implement forwarding. Control cimuitr y 
306 coordinates the operand nrvDnitor and control circuitry 302 and instruction movement monitor drcuitry 304. 

In the preferred embodiment, the forwarding circuitry is part of the register file control found in the physical 
35 register file (24 in Figure la). The register file control also maintains the residual control words. 

While forwarding has been discussed in relation to a processor using two instruction pipelines, it could 
be similarly used in connection with any number of pipelines. In this case, the forwarding control circuitry would 
nfH>nilor the residual control words associated with instructions in the EX units of each of the pipelines at the 
EX and AC2 stages. 

^ Forwarding and register translation are independent of one another. In a given microprocessor, either or 

both techniques can be used to increase instruction throughput 

10. Conclusion 

45 White the present invention has been described in connection with a specific emt>odiment of two pipelines 

with spedf ic stages, it should be rioted that the invention, as defined by the daims, could be used in connection 

with more than two pipelines and different stage configurations. 

The pipe control unit disdosed herein provides an efficient flow of instructions through the pipeline, which 

increases the rate at which the instructions are processed. Hence, a higher instruction throughput can be ach- 
so ieved without resort to hlgha- frequencies. Further, the register translation unit and forwarding eliminate many 

dependencies, thereby reducing the need to stall instructions. 

Although the Detailed Description of the invention has been directed to certain exemplary embodiments, 

various oKxJif tcattons of these embodiments, as well as alternative embodiments, will be suggested to those 

skilled In the art For example, while various methods and circuits for pipeline control have been illustrated in 
55 conjunction with one another, independent use one or more of the vanous methods and circuits will generally 

lead to benef Ida! results. 

The invention encompasses any modifications to the described emi>odiments or alternative embodiments 
that fall within the scope thereof. 
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Claims 

1. A superscalar, pipelined processor with a plurality of execution pipelines each with a plurality of stages 
in which instructions issued into a pipeline are processed, thereby altering processor state, comprising: 

instruction issue means for issuing instructions into multiple pipelines without regard to data de- 
pendencies between such issued instructions; and 

pipeline control nrveans for monitoring data dependencies between instructions in multiple pipe- 
lines; 

the pipeline control means controlling the flow of instructions through the stages of the pipelines 
such that a first instruction in a current stage of one pipeline ts not delayed due to a data dependency on 
a second instruction in another pipeline unless such data dependency must be resolved for proper proc- 
essing of the first instruction in such current stage. 

2. The processor of Claim 1 wherein the data dependencies to be resolved include read-af ter-write depend* 
encies in which a first Instruction in a first pipeline must write results to a register before a second in- 
struction in a second pipeline can read such result data from that register. 

3. The processor of Claim 1 or 2, further comprising register translation means, including register renaming 
means for eliminating write-after-write and writo-af ter-read data dependencies using register renaming. 

4. The processor of Claims 1-3. further comprising data forwanJIng means for eliminating, for selected in- 
structions, data dependencies by forwarding an operand or a result from a senior instruction directly to 
a junior instruction. 

5. A superscalar, pipelined processor processor with a plurality of execution pipelines each with a plurality 
of stages In which instructions issued into a pipefine are processed, thereby altering processor state, com- 
prising: 

instruction issue means for issuing instructions into the pipelines: and 

pipe control means for nrwnitoring the relative sequence of instructions in the pipelines, and, for 
each stage of each pipeline in which an instruction can cause an exception. nrK>nitoring the exception sta- 
tus of an instruction; 

the pipe meanscontrolling instruction flow in the pipelines such that, for a given stage, a junior In- 
struction cannot modify the processor state before a senior instruclionuntil after the senior instruction 
can no longer cause an exception. 

6. A superscalar, pipelined processor with a plurality of execution pipelines each with a plurality of stages 
in which instructions issued into a pipeline are processed, thereby altering processor state, comprising: 

instruction issue means for Issuing instructions into the pipelines; and 

pipe control means, including pipe switching means for selectively switching instructions between 
pipelines; 

the pipe switching means ordering instructions in the pipelines so as to reduce dependencies be- 
tween instructions. 

7. A superscalar, pipelined processor with a plurality of execution pipelines each with a plurality of stages, 
including an execution stage, in whidi instructions issued into a pipeline are processed, thereby altering 
processor state, comprising: 

instruction issuing means for issuing instructions into the pipelines; 

for selected instructions, the instruction issue means issuing a single instruction into at least two 
pipelines 

microcontrol means for providing independent microinstruction flows to respective execution sta- 
ges of the pipelines: 

for each such selected instructwns, the microcontrol means seleclivBly controlling the flow of mi- 
croinstructions to the execution stages of the at least two pipeliries such that these execution stages are 
independently controlled to process the selectedi Instruction. 

8. A superscalar, pipelined processor with a plurality of execution pipelines each with a plurality of stages 
in which instructions issued into a pipeline are processed, thereby altering processor state.comprising: 

instruction issue means for issuing instructions into the pipelines; and 

pipe control means for monitoring, for each pipeline, state information for each stage; 
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for each pipeline, the pipe control means controlling the flow of instructions between stages re- 
sponsive to processor state information, such that an instroctton is allowed to advance from one stage 
to another independent of the movennent of other instructions in corresponding stages of other ptpelines. 

5 A superscalar, superpipelined processor with a plurality of execution pipelines each with a plurality of sta- 

ges in which instructions issued into a pipeline are processed, thereby altering processor state, compris- 
ing: 

instruction issue means for issuing instructions into the pipelines, including first and second in- 
structions; 

10 pipe control means for detecting any dependencies between the first and second instructions; and 

register translatbn means responsive to detection of an instruction dependency by the pipe control 
means for modifying an operand source for one of the first and second instructions to eliminate a de- 
pendency between suchinstructions. 
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10. A pipelined processor having an execution pipeline with a plurality of stages in which instructions issued 
into a pipeline are processed, thereby altering processor state, the instructions referencing a defined set 
of logical registers^comprising: 

a register file including a plurality of physical registers in excess of the number of logical registers; 
register translation means for allocating physical registers to logical registers, and for each physical 
register, storing an indication of whether such physical register holds the current value of the correspond- 
ing logical register; 

in response to an instruction that writes a resultto a destination logical register register, the register 
translation means allocating a new physical register to such destination logical register, and designates 
such that the new physical register as current for such destination logical register; 

checkpointing means for checkpointing state information defining a set of physical registers that 
have been most recently allocated to each of the logical registers prior to allocating a physical register, 
and associating such checkpointed state information with a current stage for the instruction causing the 
allocation; and 

exception handling means for restoring such checkpointed state information responsive to an ex* 
ception caused by the instr\iCtion causing the allocation. 

11. A pipelined processor having an execution pipeline with a plurality of stages in which instructions issued 
into a pipeline are processed, thereby altering processor state, the instructions referencing a defined set 
of logical registers having multiple addressable sizes as sources and destinations of operands for the in- 
struction, compris ing: 

a register file induding a plurality of physical registers in excess of the number of logical registers: 

and 

register translation means for selectively allocating one of said physicai registers to one of said log* 
teal registers responsive to an instruction for writing to said one of said logical registers and the size as- 
sociated with the logical register. 

12. A pipelined processor having an executk>n pipeline with a plurality of stages in which instructions issued 
into a pipeline are processed, thereby altering processor state, the instructions referencing a defined set 
of logical registers having multiple addressable sizes as sources and destinations of operands for the in- 
struction, comprising: 

a register file with a plurality of physical registers in excess of the number of logical registers; 
scoreboard means for associating with each physical register a indicatcn of whether it is a current 
register, and a a logical ID code kJentif ying the logicai register to which the physteal register is allocated; 
and 

register translation means for receiving a logical ID code for a requested logical register and com- 
paring the received logical ID code to the logical ID code stored in the second nr>emory; 

the register translation means outputting a physical 10 code if said received logical ID code corre- 
sponds to the logical ID code stored in said second memory and if said first memory indicates that the 
physical register is a current register. 

^ 13. A pipelined processor having an execution pipeline with a plurality of stages in which instructions issued 
Into a pipeline are processed, thereby altering processor state, the Instructions referencing a defined set 
of logical registers tiaving multiple addressable sizes as sources and destinations of operands for the In- 
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slruction.comprisiiiy: 

a regisler file wilh a plurality of physical registers in excess of the number of logical registers; 
register translation means for generating signals for respective logical registers indicating whether 
a data dependency exists for that logical register. 

5 

14. A pipelined processor having an execution pipeline with a plural'rty of stages in which instructions issued 
into a pipeline are processed, thereby altering processor state, the instructions referencing a defined set 
of logical registers having multiple addressable sizes as sources and destinations operands for the in- 
struction, comprising: 

to micfocontrol means for providing, for each instruction, a sequance of microinstructions controlling 

execution of the instructions in an execution stage of the execution pipeline: 

a register file with a plurality of physical registers for storing information associated with the logical 
registers; and 

register translation means for allocating physical registers to the logical registers; and 
15 for selected instructions, the register translation means being enabled to be controlled by miao- 

instructions from the miciDcontrol means. 
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